-
Notifications
You must be signed in to change notification settings - Fork 174
feat: add new HanLP integration with ChineseDocumentSplitter #1943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… - Added release note YAML file in notes/ - Reverted config.yaml - Implemented lazy import for hanlp - Removed main guard block from module
… and fix lint issues
DocumentSplitter keeps a source_id in the meta data. We might want to do the same in ChineseDocumentSplitter https://github.com/deepset-ai/haystack/blob/c18f81283c97b950d14238a9d1fa266c3afaf506/haystack/components/preprocessors/document_splitter.py#L231 |
docs = result["documents"] | ||
assert all(isinstance(doc, Document) for doc in docs) | ||
assert all(doc.content.strip() != "" for doc in docs) | ||
assert any("。" in doc.content for doc in docs), "Expected at least one chunk containing a full stop." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We could check more explicitly that chunks end with 。
-
have a second parameter in the asserts better explaining if the assert fails what the issue is. self documenting for the reader
) | ||
splitter.warm_up() | ||
result = splitter.run(documents=[doc]) | ||
docs = result["documents"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check that there are exactly three documents
splitter.warm_up() | ||
result = splitter.run(documents=[doc]) | ||
docs = result["documents"] | ||
assert all(doc.content.strip().endswith(("。", "!", "?")) for doc in docs), "Sentence was cut off!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the test to: if the document contains 。then 。 must be the final character of the document.
split_threshold: int = 0, | ||
respect_sentence_boundary: bool = False, | ||
splitting_function: Optional[Callable] = None, | ||
particle_size: Literal["coarse", "fine"] = "coarse", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could discuss other names such as granularity
or segmentation_granularity
|
||
return text_splits, split_start_page_numbers, split_start_indices | ||
|
||
def _split_by_hanlp_sentence(self, doc: Document) -> List[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also call this when split_by is set to word. therefore let's rename the name of this function
|
||
for sentence_idx, sentence in enumerate(sentences): | ||
current_chunk.append(sentence) | ||
if particle_size in {"coarse", "fine"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always true
num_words = 0 | ||
|
||
for sent in reversed(sentences[1:]): | ||
if particle_size in {"coarse", "fine"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always true
# 'fine' represents fine granularity word segmentation, | ||
# default is coarse granularity word segmentation | ||
|
||
if self.particle_size in {"coarse", "fine"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always true
Related Issues
Proposed Changes:
warm_up
method loading the models, support for English language is removed, and ChineseDocumentSplitter is no longer inheriting from the DocumentSplitterHow did you test it?
We should test with this notebook. It shows how to use the new component in the forked repository: https://github.com/mc112611/haystack/blob/307f8340b2e1a9104efe4e33d8c1885d17143c36/examples/chinese_RAG_test_haystack_chinese.ipynb
Notes for the reviewer
Before this can be reviewed we need to work on:
self.language == english
I had a look at the other tokenizers that HanLP supports. All of them seem to be worse than the two tokenizers that we support in this integration. Therefore, I'd limit the user's options to just the two. https://hanlp.hankcs.com/docs/api/hanlp/pretrained/tok.html
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.